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ABSTRACT 

Miniature inverted-repeat transposable elements 
(MITEs) are prevalent in eukaryotic species 
including plants. MITE families vary dramatically 
and usually cannot be identified based on 
homology. In this study, we de novo identified 
MITEs from 41 plant species, using computer 
programs MITE Digger, MITE-Hunter and/or 
Repetitive Sequence with Precise Boundaries 
(RSPB). MITEs were found in all, but one 
{Cyanidioschyzon merolae], species. Combined 
with the MITEs identified previously from the rice 
genome, >2.3 million sequences from 3527 MITE 
families were obtained from 41 plant species. In 
general, higher plants contain more MITEs than 
lower plants, with a few exceptions such as 
papaya, with only 538 elements. The largest 
number of MITEs is found in apple, with 237302 
MITE sequences. The number of MITE sequences 
in a genome is significantly correlated with 
genome size. A series of databases (plant MITE 
databases, P-MITE), available online at http:// 
pmite.hzau.edu.cn/django/mite/, was constructed 
to host all MITE sequences from the 41 plant 
genomes. The databases are available for 
sequence similarity searches (BLASTN), and MITE 
sequences can be downloaded by family or by 
genome. The databases can be used to study the 
origin and amplification of MITEs, MITE-derived 
small RNAs and roles of MITEs on gene and 
genome evolution. 

INTRODUCTION 

Miniature inverted-repeat transposable elements (MITEs) 
are prevalent in eukaryotic genomes, and are believed to 
be deletion derivatives of DNA transposons (1,2). Like 



autonomous DNA transposons, MITEs usually have 
terminal inverted repeats (TIR), flanked by short direct 
repeats [also called target site duplication (TSD)]. 
Compared with autonomous DNA transposons, MITEs 
are often short (<800bp) and do not encode transposases. 

MITEs are often located in gene-rich euchromatic 
regions and are associated with genes (3,4). Several 
pieces of evidence suggest that MITEs may affect the ex- 
pression of nearby genes. MITE Kiddo in rice was shown 
to upregulate the expression of Ubiquitin2 when inserted in 
its promoter region (5). However, in other cases, MITE 
insertions downregulate the expression of nearby genes 
(6,7). Such downregulation is most hkely through small 
RNAs derived from MITE sequences (6,8). MITE trans- 
positions generate much genetic diversity for a species 
(9-11). Considering the effects of MITEs on gene expres- 
sion and variation of MITE insertions in different geno- 
types, MITEs may contribute to considerable phenotypic 
diversity as well (12). 

The first MITE families were discovered through 
sequence analysis (i.e. identification of TIR and TSD 
sequences) of insertions of 100-600 bp (13,14). Recently, 
computer programs were developed to systematically 
identify MITEs from a database such as genome se- 
quences (6,15-19). Among them, the most successful 
ones are MITE Digger, MITE-Hunter and RSPB, which 
identified the vast majority of MITEs in the sequenced 
genome of rice (6,18,19). The recently reported program 
MITE Digger is most efficient for de novo MITE identifi- 
cation, particularly in large genomes (19). RSPB is better 
at identifying MITE families with atypical structures such 
as MITEs with no TSD or short/diverse TIR sequences. 
Unfortunately, RSPB requires high computer capacity not 
found in most laboratories. We predicted that combining 
MITE Digger, MITE-Hunter and RSPB would allow the 
detection of a vast majority of, if not all, MITE famihes in 
a genome, with no prior information required. With the 
availabihty of the three MITE detecting programs and the 
genome sequences of many plant species, MITEs in 
several genomes can be readily identified and compared 
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to further our understanding of MITE origin and 
evolution. 

MITEs, as repetitive sequences, were included in otlier 
databases such as the The Institute for Genomic Research 
(TIGR) Plant Repeat Databases and Repbase (20,21). 
However, MITEs vary dramatically and usually cannot 
be identified through homology search between distantly 
related species, and consequently, only a small proportion 
of MITE famihes have been identified and included in 
these databases. In this study, MITEs were de novo 
identified from 41 plant species using computer 
programs MITE Digger, MITE-Hunter and/or RSPB. 
Each MITE family was annotated manually. All verified 
MITE famihes were stored in a database, P-MITE (for 
plant MITE). BLASTN search function was appended 
into the database. MITE sequences from each genome 
were downloadable. P-MITE wiU be helpful for the anno- 
tation of genes and genomic sequences. It can also be used 
to study the origin and amplification of MITEs, the com- 
parative analysis between different species, the MITE- 
derived small RNAs and the roles of MITEs on gene 
and genome evolution, etc. 

MATERIALS AND METHODS 

Plant genomes used in this study 

Forty-one sequenced and published genomes of plant 
species, including six lower plant species, were included 
in this study for MITE identification. The information 
of the 41 species and the Web sites for their genome se- 
quences are listed in Supplementary Table S 1 . The MITEs 
from rice were identified and annotated in a previous 
study (6). 

De novo identification of MITEs using MITE Digger, 
MITE-Hunter and RSPB 

MITEs from 41 genomes were de novo identified using 
program MITE Digger, MITE-hunter and/or RSPB 
(6,18,19). First, program MITE-Hunter was used to run 
the sequences of each genome. The resulting groups of 
potential MITEs were manually checked for TSD and 
TIR sequences. Groups with no precise boundaries (ter- 
minals) or no TIR sequences were not considered as 
MITEs. The confirmed MITEs from MITE-Hunter were 
put into a database (MITE-Hunter database). To save 
running time, program RSPB was slightly modified so 
that the confirmed MITE sequences in the 'MITE- 
Hunter database' were skipped by RSPB. New groups 
of repetitive sequences with precise boundaries were 
reported and checked manually for TSDs and TIRs 
(Supplementary Figure SI). No TSD and TIR informa- 
tion is required to run RSPB, which identifies repetitive 
sequences with precise boundaries. In subsequent manual 
annotation, only repetitive sequences <800 bp and TSD/ 
TIR features similar to known MITE superfamilies were 
maintained. Five species with large genomes or too many 
short contigs were not successful using RSPB. MITE 
Digger, released recently, was also used to run some 
genomes, including genomes > 800 Mb. The statistics of 
MITE families identified in this study is shown in 



Supplementary Table S2. The number of MITE families 
that were detected by RSPB, but not by MITE Hunter, is 
shown in Supplementary Table S3. 

Classification of MITE superfamily and family 

A Perl script was written to cluster MITEs identified 
above into a family if they had significant sequence simi- 
larity (BLASTN e<10"'°) (6). MITE families were 
assigned into superfamilies based on their TIR and TSD 
sequences. Each MITE family in a genome was named as 
code_Abc#, where Ab is the first two letters from its genus 
name, c the first letter from its species name and # a con- 
secutive number. Different superfaniihes are represented 
by different codes, with DTT for Tc] /Mariner, DTM for 
Mutator, DTA for tiAJ, DTC for CACTA, DTH for PIF/ 
Harbinger, DTP for P, DTN for Novosib and DTx for 
unknown (21-23). MITEs with ambiguous TSD and/or 
TIR features were annotated as unknown superfamily 
(DTx). MITE famihes preferentially inserted into simple 
tandem repeats (microsatellites) were considered as an in- 
dependent group, MiM (MITEs inserted in microsatellite). 
A 'representative' element was chosen for each family, and 
the representative elements should have good TIR and 
perfect TSD sequences if possible. A MITE sequence 
was considered as a fuU-length element when its terminals 
were no more than 3 bp shorter than the representative 
sequence. To identify all MITE elements, including 
diverse and/or partial ones, in a genome, a library of aU 
representative elements from each family was used as 
query sequences to search the entire genome sequence 
using RepeatMasker v3.2.9 (http://www.repeatmasker. 
org/). 

RESULTS AND DISCUSSION 

De novo identification of MITEs in 41 plant genomes 

Program MITE-Hunter was applied to 41 plant genomes 
for genome-wide de novo identification of MITEs. RSPB 
was also used to run all but five genomes that are either 
>800Mb or with too many contigs. MITE Digger was 
used to search some genomes, including four skipped by 
RSPB. The MITE sequences obtained from this study 
were used to execute a BLASTN search of the Repbase, 
the database most frequently used for repetitive sequences 
(21). More than 70% of MITE families identified from 
this study were not included in Repbase (< 10"'"), 
MITE-Hunter, but not RSPB, due to too large genome. 
A total of 252 MITE famihes were obtained from maize, 
which include 97 novel families not covered by maize TE 
database. However, 61 MITE families listed in maize TE 
database were not identified by either MITE Digger or 
MITE-Hunter. The computing process of RSPB needs 
to be mended before it can be applied to large genomes, 
such as maize, to identify more novel MITE families. 

The majority of MITEs were classified into five 
superfamilies, including Tel /Mariner, PIF/ Harbinger, C 
ACTA, hAT and Mutator. Two superfamilies, P and 
Novosib, were detected in the genomes of lower plants, 
although they do not have Tel /Mariner, CACTA and 
Mutator. Sixteen MITE families were unclassified owing 
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to ambiguous TSD and/or TIR features. MiM is the least 
frequent in plant genomes (Supplementary Table S2). The 
MiM group is present in only 10 of the 41 genomes, with 
41 893 elements from 33 famihes. The strawberry genome 
contains 14 MiM families, whereas the others have no 
more than four MiM families. Most elements of these 
MiM families, including the Micron family in rice (24), 
were inserted in (TA)n repeats, with only a few exceptions, 
in which they were inserted into (CA)n/(GT)n repeats. 
Elements from the MiM group have poor TIR sequences, 
and no conserved nucleotides were found in their ter- 
minals among different famihes. It remains unclear 
whether different MiM families belong to the same super- 
family, i.e. activated by the same type of transposase. In 
contrast to the scarce MiM group, the Mutator superfani- 
ily has 852 390 elements in the 41 genomes included in this 
study, with an average of >20 790 elements per genome. 



MITEs with significant nucleotide identities (BLASTN 
e < 10~'°) were grouped into a family. The largest MITE 
family is the DTM_Mad25 from the apple genome, with 
18 904 elements. The smallest MITE famihes, DTT_Sob24 
and DTH_Sob33 from the Sorghum genome, have only 
one element. 

The number of MITEs varies dramatically in different 
species. In general, the genomes of lower plants have rela- 
tively few MITEs (Table 1). No MITEs were detected in 
the genome of Cyanidioschyzon merolae using either 
MITE-Hunter or RSPB, and the genome of Selaginella 
moellendorffii harbors only 73 MITE elements. The 
number of MITEs also varies considerably among the 
genomes of higher plants. For example, only one MITE 
family with 538 elements was detected in the papaya 
genome, whereas 237 302 elements from 180 MITE 
famihes are present in the apple genome. Large variations 



Table 1. MITE in 41 plant genomes 



Species 


Family 


Genome size (Mb) 






MITE 










Family 


Element 


Total 


Percentage 








number 


number 


length (Mb) 


in genome 


Phoenix dactylifera 


Arecaceae 


381.56 


33 


39 990 


8.22 


2.15 


Arahidopsis thaliana 


Brassicaceae 


119.67 


43 


3245 


0.85 


0.71 


TheUungiella parvula 


Brassicaceae 


123.6 


7 


1161 


0.32 


0.26 


Arahidopsis lyrata 


Brassicaceae 


206.67 


121 


18 039 


4.64 


2.24 


TheUungiella salsuginea 


Brassicaceae 


208.87 


54 


5133 


1.27 


0.61 


Brassica rapa 


Brassicaceae 


283.84 


174 


45 821 


11.49 


4.05 


Carica papaya 


Caricaceae 


342.68 


1 


538 


0.21 


0.06 


Chlamydomonas reinhardtii 


Chlamydomonadaceae 


111.1 


20 


3508 


0.99 


0.89 


Chlorella variabilis 


Clilorellaceae 


46.16 


2 


83 


0.04 


0.08 


Cuciimis sativiis 


Cucurbitaceae 


203.06 


7 


10810 


2.02 


1.00 


Citriilhis lanatus 


Cucurbitaceae 


353.47 


35 


94 314 


19.55 


5.53 


Cuciimis melo 


Cucurbitaceae 


431.04 


10 


12991 


2.79 


0.65 


Cyanidioschyzon merolae 


Cyanidiaceae 


16.54 


0 


0 


0.00 


0.00 


Jatropha curcas 


Euphorbiaceae 


297.67 


17 


18 975 


4.81 


1.61 


Riciniis communis 


Euphorbiaceae 


350.63 


33 


13 205 


3.24 


0.93 


Manihot esculenta 


Euphorbiaceae 


532.53 


21 


30 934 


8.94 


1.68 


Medicago truncatula 


Fabaceae 


307.48 


288 


132 834 


25.24 


8.21 


Lotus japonicus 


Fabaceae 


316.89 


172 


71811 


14.16 


4.47 


Cajanus cajan 


Fabaceae 


605.78 


92 


135 581 


31.06 


5.13 


Cannabis sativa 


Fabaceae 


786.64 


53 


110123 


24.06 


3.06 


Glycine max 


Fabaceae 


973.34 


126 


169 379 


27.69 


2.84 


Physcomitrella patens 


Funariaceae 


479.99 


4 


3718 


0.58 


0.12 


Linum usitatissimum 


Linaceae 


318.25 


28 


14409 


3.51 


1.10 


Theobroma cacao 


Malvaceae 


327.35 


13 


10 364 


3.45 


1.06 


Musa acuminate 


Musaceae 


472.96 


9 


15835 


2.22 


0.47 


Coccomyxa subellipsoidea 


Palmellaceae 


48.95 


4 


187 


0.04 


0.09 


Brachypodium distachyon 


Poaceae 


271.92 


222 


83 272 


12.86 


4.73 


Oryza sativa^ 


Poaceae 


373.25 


339 


179415 


37.27 


9.98 


Setaria italica 


Poaceae 


405.78 


178 


69 264 


15.60 


3.85 


Sorghum hicolor 


Poaceae 


738.58 


275 


112 307 


29.63 


4.01 


Zea mays 


Poaceae 


2058.58 


252 


192 529 


40.36 


1.96 


Fragaria vesca 


Rosaceae 


206.89 


162 


34 880 


8.97 


4.33 


Mains domestica 


Rosaceae 


881.28 


180 


237 302 


44.63 


5.06 


Primus persica 


Rosaceae 


227.25 


99 


39110 


8.84 


3.89 


Citrus sinensis 


Rutaceae 


327.94 


106 


46 032 


11.35 


3.46 


Popuhis trichocarpa 


Salicaceae 


417.14 


22 


35081 


7.49 


1.80 


Selaginella moellendorffii 


Selaginellaceae 


212.76 


1 


73 


0.01 


0.01 


Solanum lycopersicum 


Solanaceae 


781.67 


104 


107 087 


26.89 


3.44 


Solanum tuberosum 


Solanaceae 


797.83 


171 


170 392 


38.65 


4.84 


Vitis vinifera 


Vitaceae 


486.19 


35 


61065 


14.69 


3.02 


Volvox carteri 


Volvocaceae 


131.16 


14 


2104 


0.62 


0.47 



^The MITE sequences from rice were retrieved from Lu et al. (25). 
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in total number of MITE elements also occur between 
closely related species. For example, the Arabidopsis 
thaliana genome has only 3245 MITE elements, whereas 
its close relative, Arabidopsis lyrata, contains 18 039 
MITE-related sequences. Similarly, the number of 
MITEs in the genome of watermelon (with 94 314 MITE 
elements) is seven times as much as in the genome of 
melon (with 12 991 MITE elements). 

The number of MITEs in a genome is significantly 
correlated with its genome assembly size (r = 0.72, 
i'<0.01; Table 1; Figure 1). A similar correlation coeffi- 
cient (/' = 0.68, P < 0.01) was obtained when the six lower 
plants were excluded from the analysis. Nevertheless, 
several striking exceptions were observed. For example, 
the rice genome is only 373 Mb but has the third largest 
number (179415) of MITEs among all species studied, 
whereas papaya with genome size (342 Mb) similar to 
that of rice, has only 538 elements of one MITE family 
(Table 1). 

The construction and the use of plant MITE database, 
P-MITE 

A total of 2.3 million sequences of 3527 MITE families 
were obtained from 41 (including the rice genome) plant 
genomes. A series of databases containing MITEs from 
the 41 plant genomes was constructed. Elements from 
each of the 3527 MITE families were checked and 
annotated manually, and one element with better TSD 
and/or TIR features was chosen as a representative of 



the family. A database containing all representative 
elements was constructed, which can be used to study 
the structure of MITEs, such as their TSD and TIR 
features. 

The aforementioned databases are collectively named as 
P-MITE (for plant MITE), and can be found in http://pmite. 
hzau.edu.cn/django/mite. The database is searchable using 
BLASTN algorithm. MITE sequences and representative 
elements can be downloaded by family or by genome. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Onhne, 
including [26-66]. 
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